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Abstract — We  consider  the  concept  of  a  weighted  2-stem 
similarity  function  between  two  DNA  sequences  and  discuss  DNA 
codes  based  on  this  similarity.  An  optimal  construction  of  such 
codes  is  suggested.  A  random  coding  bound  on  the  rate  of  DNA 
codes  is  proved.  To  obtain  the  bound,  we  use  some  ensembles 
of  DNA  sequences  which  are  generalizations  of  the  Fibonacci 
sequences. 

I.  Introduction 

In  order  to  accomplish  DNA  computing,  it  is  necessary 
to  have  DNA  libraries,  also  known  as  DNA  codes,  of  large 
size  and  small  energies  of  hybridization  between  the  DNA 
sequences.  The  ultimate  criterion  for  the  value  of  a  similarity 
for  DNA  codes  is  the  degree  to  which  it  approximates  actual 
bonding  energies,  which  in  turn  determines  the  degree  to 
which  similarity  approximates  the  likelihood  of  one  codeword 
mistakenly  binding  to  the  reverse  complement  of  another 
codeword.  We  can  use  a  branch  of  mathematics  known  as 
coding  theory,  that  was  initiated  around  the  same  time  that 
the  structure  of  DNA  was  discovered,  to  study  the  space  of 
DNA  sequences  endowed  with  a  measure  of  similarity.  The 
introduced  measure  of  similarity  between  DNA  sequences 
has  an  immediate  application  in  determining  the  similarities 
between  genes,  expressed  as  DNA  sequences,  in  any  existing 
genome.  Codes  built  on  spaces  of  DNA  sequences  can  be 
implemented  in  Biomolecular  Computing  and  could  have  other 
important  applications.  A  conventional  similarity  function  for 
measuring  codeword  similarity  is  the  well  known  deletion  sim¬ 
ilarity,  i.e.,  the  length  of  a  longest  common  subsequence  [7], 
The  works  of  D’yachkov  et  al.  [2],  [3],  [4]  suggest  to  use 
the  length  of  a  longest  common  block  subsequence,  which 
imposes  an  additional  adjacency  requirement,  with  the  goal  of 
modeling  actual  bonding  energies.  In  this  paper,  we  introduce 
the  concept  of  a  stem  similarity  function  which  provides  a 
more  accurate  estimation  [1],  [2]  of  the  hybridization  energy. 

II.  Statement  of  Problem 
A.  Notations  and  Auxiliary  Definitions 

The  symbol  =  denotes  definitional  equalities  and  the  symbol 
[n]  =  {1,2, . . . ,  ?t}  denotes  the  set  of  integers  from  1  to  n. 

°The  work  was  supported  by  AFOSR  -  FA8750-07-C-0089 


Let  {.4,  C,  G,  T}  be  the  standard  DNA  alphabet.  For  any  letter 
x  €  {.4,  C,  G,T},  we  define 

!T  if  x  =  A , 

G  if  x  =  C, 

C  if  x  =  G, 

.4  if  x  =  T 

which  is  called  a  complement  of  the  letter  x.  This  means 
that  the  DNA  alphabet  {A,C,  G,T}  consists  of  two  pairs 
of  mutually  complementary  letters'.  A  =  T,  T  =  A  and 

C  =  G,  G  =  G. 

Let  x  =  (x1 ,  x2 , . . . ,  x„)  and  y  =  (j/j  ,  y2 , . . , ,, yn),  where 
x,y  €  {A,  C,  G,  T}n,  be  two  arbitrary  DNA  n-sequences.  By 
symbol  z=  (zi ,  z2, . . . ,  zf)  G  {A,G,G,T}{,  t  G  [n],  we  will 
denote  a  common  subsequence  [7]  of  length  |z|  =  £  between  x 
and  y.  The  empty  subsequence  z  of  length  |z|  =  0  is  a  common 
subsequence  between  any  sequences  x  and  y. 

Definition  1.  Let  2  <  r  <  n  be  an  arbitrary  integer.  A 
fixed  DNA  r-sequence  a  =  (r/| ,  «o, . ..  ,ar)  G  {A,C,G,T}r, 
is  called  a  common  block  for  sequences  x  and  y  (briefly,  com¬ 
mon  (x  ,y) -block)  of  length  r  if  sequences  x  and  y  (simultane¬ 
ously)  contain  a  as  a  subsequence  consisting  of  r  consecutive 
elements  of  x  and  y.  We  will  say  that  a  common  (x,y)-block  a 
yields  r  — 1  common  2-stems  a,,  a; .  \  ,  i  G  [r—  1],  containing 
2  adjacent  symbols  of  the  given  common  (x,y)-block. 

Definition  2.  Let  2  <('  <  n  be  an  integer.  A  sequence 
z  =  (z\ ,  z-2, . . . ,  zf)  G  {-4,  C,  G,  T}e  is  called  a  common  block 
subsequence  of  length  |z|  =  l  between  x  and  y  if  z  is 
an  ordered  collection  of  non-overlapping  (separated)  common 
(x,y)-blocks  and  the  length  of  each  common  (x,y)-block  in 
this  collection  is  >  2.  Let  Z(x,y)  be  the  set  of  all  common 
block  subsequences  between  x  and  y.  For  any  z  G  Z(x,y), 
we  denote  by  k(z,x,y ),  1  <  k(z,x,y )  <  |z|/2,  the  minimal 
number  of  common  (x,y)-blocks  which  constitute  the  given 
subsequence  z. 

Note  that  the  difference  |z|  —  k(z,x,y),  z  G  Z(x,y),  is  a  total 
number  of  common  2-stems  containing  adjacent  symbols  in 
common  (x,y) -blocks  constituting  z  G  Z(x,y). 

Definition  3.  [1],  [2]  For  sequences  x,y  G  {A,G,G,T}n, 
the  number 

S(x,y)  =  max  {|z|  -  k(z,x,y)} ,  S(x,y)  >  0,  (1) 

zez(x,y) 
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is  called  an  2-stem  similarity  between  x  and  y.  Obviously, 
S(x,y)  =  S(y,x )  <  S(x,x)  =  n  —  1. 

For  any  x  =  {x\,  xo, . . . ,  xn)  €  { A,C,G,T}n ,  we  intro¬ 
duce  its  reverse  complement  (Watson-Crick  transformation) 

*  =  (xn,Xn- 1 ,  •  •  •  ,  *2,  Xl)  e  {A,  C,  G,  T}n.  (2) 

If  y  =  x,  then  x  =  y  for  any  x  €  {A,  C.  G,T}n.  If  x  =  x, 
then  x  is  called  a  self  reverse  complementary  sequence.  If 
x  f  x,  then  a  pair  ( x  ,  x)  is  called  a  pair  of  mutually  reverse 
complementary  sequences. 

Example.  Let  n  =  10  and 


x  =  (A,T,T\A^A,A,A,T\T\A), 

y  =  *  =  (T^LA  t,  t,  tvga  a,t). 

A  common  block  subsequence  z  between  x  and  y  =  x  is 

z  =  ( T ,  A,  A,T,  T,  A)  —z  =  (x3 ,  x4 ,  x5 ,  x$ ,  x9 ,  xi0)  = 

=  (yi,  2/2* 2/3 , 2/6 ,  i/7 , 2/s )  G  Z{x,y). 

The  value  k(z,x,y)  =  2  and  the  corresponding  2-stem 
similarity  is 

S(x,y)  =  max  {|z|  -  k(z,x,y)}  =  6-2=4. 
zez(x,y) 

The  maximal  value  is  achieved  for  the  above  self  reverse 
complementary  sequence  z  £  Z(x,y). 


B.  Weighted  Stem  Similarity  and  Distance 

Let  w  =  w(a,b)  >  0,  a,b  €  {A,  C,G,T},  be  a  weight 
function  such  that 


w(a,b)  =  w(b,a),  a,b  £  {A,G,G,T}.  (3) 

Condition  (3)  means  that  w(a,  b)  is  an  invariant  function  under 
Watson-Crick  transformation. 

Definition  4.  [1],  [2]  Let  z  G  Z(x.y)  have  the  form 

k{z,x,y)  Hz,x,y ) 

ki  =  E  i^mi  =  E 

m= 1  m=  1 


where 


(zr,z?,  ...,zZ)  G  {A,C,  G,TY 
m  =  1,2  ,...,k(z,x,y), 


is  an  ordered  collection  of  common  (x.yj-blocks  constituting  z 
and  rm  =  |zm|  >  2  is  the  length  of  block  zm .  For  DNA 
sequences  x,y  £  {A,C,G,T}n,  the  number 


S(wUx,y)  =  max 
zez(x,y) 


Hz,x,y)  ,,m-i 

E  E 


m=  1 


i= 1 


(4) 


is  called  a  weighted  2-stem  similarity  between  x  and  y.  We  will 
say  that  Sl'"'  f  x.y)  =  0  if  and  only  if  the  set  Z(x,y )  =  0. 


Function  S1-"'1  (x.y)  is  used  to  model  [2],  [3],  [4]  a  ther¬ 
modynamic  similarity  ( hybridization  energy)  between  DNA 
sequences  x  and  y. 

Proposition  1.  For  any  x.  y  £  {A,G,G,T}n,  the  function 
S{w\x,y)  =  S(w\y,x)  <  SW(x,x)  (5) 

In  addition, 

S(w\xj)  =  S(w)(y,*),  x,y  £  {A,C,G,T}n.  (6) 

The  symmetry  property  and  inequality  (5)  are  evident. 
Equality  (6)  follows  from  definitions  (2), (4)  and  condition  (3). 
Identity  (6)  means  the  symmetry  property  of  hybridization 
energy  between  DNA  sequences  x  and  y  [2],  [4], 

One  can  easily  check  that  2-stem  similarity  S(x,y)  from 
Definition  3  corresponds  to  the  uniform  weight  function: 
w(a,b)  =  1  for  any  a,b  £  {A,  C,G,T}.  Table  1  shows  an 
example  [2]  of  values  for  w(a,,b )  which  satisfy  (3)  and  have 
a  significant  biological  motivation: 


w(a,  b) 

II 

b  =  C 

b  =  G 

Fh 

II 

tO 

a  =  .4 

1.02 

1.46 

1.29 

0.88 

a  =  C 

1.46 

1.83 

2.17 

1.29 

a  =  G 

1.32 

2.24 

1.83 

1.46 

a  =  T 

0.60 

1.32 

1.46 

1.02 

Table  1. 

Definition  5.  [1]  The  number 

V{w\x,y)  =  S[w\x,x)  -  S{w\x,y)  (7) 

is  called  a  weighted  2-stem  distance  between  x  and  y. 

Typically,  D^w\x,y)  f  D^w\y,x),  i.e.,  function  (7)  is  not 
symmetric.  Proposition  1  gives: 

V(w\x,y )  >  Viw)(x,x)  =  0.  (8) 

C.  DNA  Codes  based  on  Stem  Similarity 

Let  x(j)  =  (xi (j),x2(j),...,xn(j))  £  {A,G,G,T}n, 
j  G  N,  be  codewords  of  a  code  X  =  {x(l),x(2), . . .  ,x(AT)} 
of  length  n  and  size  N,  where  N  =  2, 4, ...  is  an  even  integer. 
Let  D,  0  <  D  <  max  S(-Wfx,x),  be  an  arbitrary  positive 
number.  Taking  into  account  (7)  and  (8),  we  give 

Definition  6.  A  code  X  is  called  a  DNA  ( n,D,w )- 
code  based  on  weighted  2-stem  similarity  S1,  lrl  (x.y)  (briefly, 
(n,  D,  ui)-code)  if  the  following  two  conditions  are  fulfilled. 
(i).  For  any  number  j  £  [Ar]  there  exists  j'  £  [TV],  j'  f  k, 

such  that  x(j')  =  x (j )  x (j ) .  In  other  words.  A'  is  a 

collection  of  N/2  pairs  of  mutually  reverse  complementary 
sequences,  (ii).  For  any  j.  j’  £  [Ar],  where  j  j',  the 
distance  D^  (x(j),x(j'))  >  D. 

The  following  statement  is  obvious. 

Proposition  2.  Let  (3)  be  the  uniform  weight  function,  i.e., 

w(a,b)  =  1,  a,b£{A,C,G,T}. 

The  corresponding  symmetric  distance  function  2 7^=1'(x,y), 
x,y  £  {A,C,G,T}n  has  the  form 

V{=1](x,y)  =  V{=1](y,x)  =  (n  -  1)  -  S(x,y),  (9) 
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where  2-stem  similarity  S(x,y )  is  defined  by  (1),  and  the 
definition  of  DNA  ( n,D,=  1  )-code,  0  <  D  <  n  —  1,  is 
identified  by  inequality 

S(x(j),x(j'))  <  (n  -  1)  -  D,  j,j'  G  [N],  j  f  /.  (10) 

Definition  7.  Let  A'(t")(  n,D )  be  the  maximal  size  of 
DNA  (n,  D,  w)-codes  based  on  weighted  2-stem  similarity.  If 
d  >  0  is  a  fixed  number,  then 

A  loS4  N{w)(n,  nd)  (n) 

n— >oo  n 

is  called  a  rate  of  DNA  (n,nd,w)- codes  for  a  distance 
fraction  d. 

D.  Construction  of  DNA  (n,  2,  =  1  {-codes 

In  papers  [3],  [4],  we  introduced  the  following  definitions. 
1)  A  common  subsequence  z  =  {zi, . . . ,  zf),  2  <  l  <  n, 
is  called  a  common  block  subsequence  of  length  |z|  =  I 
between  x  and  v  if  any  two  consecutive  elements  zm,zm+ 1, 
m  =  1,2, ...  ,1  —  1,  which  are  consecutive  (separated)  in  x 
are  also  consecutive  (separated)  in  y  and  vice  versa,  i.e. 


R{w)(d) 


izm  —  xim  )  zm+ 1  —  xim  + 1)  ^  (zm  ~  Vjm  ,  zm+ 1  —  Ujm  + 1)  • 


Let  S^(x,y),  0  <  Sl3(x,y)  <  n,  denote  the  length  |z|  of 
longest  sequence  occurring  as  a  common  block  subsequence  z 
between  sequences  x  and  y.  The  number  Sl3(x,y)  =  Sl3(y,x ) 
is  called  a  block  similarity  between  x  and  y. 

2)  Let  D,  1  <  D  <  n  —  1,  be  an  arbitrary  integer.  A  code 
A'  is  called  a  DNA  ( n ,  D)-code  based  on  block  similarity 
Sl3(x,y)  if  the  following  two  conditions  are  fulfilled.  ( i )  For 
any  number  j  G  [Ar]  there  exists  j'  G  [ N ],  j'  f  j,  such  that 

x(j')  =  x(j)  f  x(j).  ( ii )  For  any  j,j'  G  [A],  where  j  f  j ', 
the  block  similarity  Sl3 (x(j),x(j'))  <n  —  D  —  1.  For  given  n 
and  D ,  we  denote  by  Nl3(n,D )  the  maximal  size  of  (n,D)- 
codes  based  on  block  similarity. 

Let  x,  y  G  {A,C,G,  T}n  be  arbitrary  DNA  sequences.  One 
can  easily  see  that  block  similarity  Sl3(x,y)  =  n  —  2  iff  the 
corresponding  2-stem  similarity  S(x,y )  =  n  —  3.  Therefore, 
from  (9)-(10)  it  follows  that  the  definition  of  DNA  (n.  1)- 
code  based  on  block  similarity  is  equivalent  to  the  definition 
of  DNA  (n,  2,=  l)-codes  based  on  2-stem  similarity.  This 
means  that  N®(n,  1)  =  A^=1'(/i,2).  Hence,  the  main  result 
of  paper  [4]  about  constructions  of  optimal  DNA  codes  based 
on  block  similarity  leads  to 

Theorem  1.  If  n  =  Am,  m  =  1, 3, 5, . . .,  then 


A(=1)(?i,  2) 


4”-1  +  4 
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E.  DNA  Codes  for  Fibonacci  Ensembles 


Let  I  be  a  collection  of  2-strings  of  DNA  letters,  closed 
under  reverse  complement  transformation.  For  instance. 


L  =  0,  L  =  {TA},  L  =  {TA,  AT} 


L  =  {TA,  AT,  AA,  TT}.  (12) 


will  say  that  [n,L]  is  the  Fibonacci  L-ensemble1 .  Denote  by 
=  \DNA(n,  L)\  =  |[n,L]|  the  cardinality  of  [n,L\. 
Definition  8.  Let  Ni(n,D )  be  the  maximal  size  of  DNA 
( n,D,=  l)-codes  A'  C  DNA(n,L).  If  the  distance  fraction 
d  >  0  is  a  fixed  number,  then 

A  E  ^SL(n,nd)  (13) 

n— >oo  7i 

is  called  a  rate  of  DNA  codes  for  the  Fibonacci  L-ensemble. 
For  a  weight  function  (3),  introduce  numbers 


wL=  min  w(a,b). 

(a,  b)£L 


(14) 


For  instance,  if  the  values  of  w  =  w(a,  b)  are  given  by  Table  1, 
then 


{0.60  if  L  =  0, 

0.88  if  L  =  {T,4}, 

1.02  if  L  =  {TA,  AT}, 

1.29  if  L  =  {TA,  AT,  A  A,  TT}. 


One  can  easily  check  [1]  that  the  distance 


Vl'w\x,y)  >  wL  ■  Vl'~1\x,y)  if  x,y  G  DN A(n,  L). 


In  virtue  of  (9)  and  (10),  this  gives 

Proposition  3.  Let  wL  be  a  number  defined  by  (14)  and  a 
code  X  C  DN A(n,  L ).  If  X  is  a  DNA  (n,  D,=  1  )-code,  then 
X  is  a  DNA  ( n,wL  ■  D ,w)-code.  Hence,  rate  (11)  satisfies 
inequality 

R{w\d)  >  max  Rl  {—')  ,  (16) 

L  WlJ 

where  Ri(d)  is  defined  by  (13). 

In  the  rest  part  of  paper,  we  obtain  a  random  coding  bound 
on  Ri(d)  for  L  defined  by  (12).  Then  applying  (16),  we  get  a 
random  coding  bound  on  the  rate  R<‘  w'1  (d)  of  DNA  ( n,nd ,  w)- 
codes  based  on  weighted  2-stem  similarity. 


III.  Random  Coding  Bounds 
A.  On  Cardinalities  of  Fibonacci  L  -Ensembles 

If  L  =  0,  then  A i(n)  =  4n.  If  L  /  0,  then  cardinalities 
Al(1)  =  4  and  Ax, (2)  =  16  —  |L|  are  given.  For  sets  L  define 
by  (12),  we  calculate  cardinalities  A i(n),  n  =  3,4. . .,  using 
the  following  well  known  result  from  the  theory  of  recurrent 
sequences. 

Proposition  4.  Let  f  \  f  0  and  f->  f  0  he  arbitrary  fixed 
numbers.  If  sequence  A i(n),  n  =  3,4, . .  .,  satisfies  recurrent 
equation 

A i(n )  =  /i  A L(n  -1 )+  f 2  A L(n  -  2),  (17) 

then 

A  L(n)=C1r}  +  C2r?,  n  =  1,2,...,  (18) 

where  r\  =  rj(L)  and  ro  =  ro(L)  are  roots  of  the  characteris¬ 
tic  equation  r2— /1  r—fn  =  0  and  C\  =  C\{L),  Co  =  Co(L) 


Denote  by  DN A(n,  L)  (briefly,  [n,L])  the  set  (ensemble)  of  'Binary  0,1-sequences  which  do  not  contain  2-stems  of  the  form  (1,1) 
all  DNA  sequences  which  do  not  contain  2-stems  from  L.  We  are  known  as  the  Fibonacci  sequences  [6]. 
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are  calculated  from  initial  conditions:  4  =  C\  r\  +  Co  ro, 
16-  \L\  =  Cni  +  Corl 
Formula  (18),  obviously,  leads  to 

Proposition  5.  If  r\,  ro  are  real  numbers,  r\  >  0  a«c/ 
ri  >  |ro |,  f/zen  Xi(n),  n  =  1,2,.  .  satisfies  inequalities 

Crn  [1  -  /3  a"]  <  AL(n)  <  Cr"  [1  +  0an] ,  (19) 


where 


r  =  r i  =  maxjri,  ro}, 


C±CU 


A 

a  = 


<i,  0  = 


Ci 


(20) 


Remark.  For  the  case  L  =  0,  bounds  (19)  will  be  true  as 
well  (with  the  sign  of  equality)  if  we  formally  define  r4  =  4, 
Ci  =  1  and  r2  =  Co  =  0,  i.e.,  C  =  1,  r  =  4  and  a  =  (3  =  0. 

Lemma  1.  If  L  =  {TA},  f/zezz  A l(/i)  satisfies  (17),  where 
fi  =4,  /o  =  —1.  Hence,  parameters  (20)  of  bounds  (19)  are: 


=  2  +  v/3  =  3.73, 


C  =  i±|^  =  !.(*, 


=  0  =  7-  4V3  = 
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.0718. 


(21) 


Lemma  2.  If  L  =  {TA,  AT},  f/zezz  A i(n)  satisfies  (17), 
where  f  \  =  3,  /o  =  2.  Hence,  parameters  (20)  of  bounds  (19) 
are: 

r  =  3-±CE  =  3.56,  C=  11^!  =  1.11, 

a  =  13  =  .158,  0  =  21  =  -0961.  (22) 

Lemma  3 .  If  L  =  {TA,  AT,  AA,TT},  then  A f,(n)  satis¬ 
fies  (17),  where  f\  =  2,  /o  =  4.  Hence  parameters  (20)  of 
bounds  (19)  are; 

r  =  l  +  v/5  =  3.24,  C  =  =  1-17, 

3-^5  oco  7-3V5 

a  =  — - —  =  .382,  0  = - - - =  .146.  (23) 

Proof  of  Lemmas  1-3.  Let  a,b  €  { A,C,G,T }  denote 
arbitrary  letters  of  DNA  alphabet  and 

[n,  L]a  =  {x  :  x  £  [n,  L]  and  xn  =  a}  , 

[n,L]a,b  =  {x  :  x  £  [ n,L\  and  xn^  =  a,xn  =  b}  , 


denote  the  corresponding  subsets  of  ensemble  [n,L\.  If  a  pair 
( a,b )  £  L,  then  subset  [n,L\a^  =  0.  Note  that  [n,L\a  and 
[n,  L\  can  be  written  as  sums  of  non-intersecting  subsets: 

[n,L\a  =  [n,L\Aa  +  [n,L\c,a  +  [ +  [ n,L]TA 

[n,L\  =  [n,  L\a  +  [//.  L\q  +  [n ,  L\ q  +  [n,L\T.  (24) 


In  addition,  one  can  easily  see  the  following  two  properties. 
1)  If  for  any  b  £  {A,C,G,T},  pair  (b,a)  £  L,  then  the 
cardinality 


\[n,L\a\  =  | [n  -  1,L]\  =  A L(n  -  1).  (25) 


2)  For  any  pair  (a,  b)  (jL  L,  the  cardinality 

\[n,L]a,b\  =  \  [n  —  1, T]Q |  .  (26) 

Let  L  =  {TA}.  In  virtue  of  (24)-(26),  we  have 
A L(n)  =  3  ^L(n~l)  +  \[n,  L\A,A\  +  \[n,  L\G,A\  +  \[n,  L\g,a\  = 
=  3Ai(n-l)  +  |[n  -  l,L]^|  +  |[n  -  l,L]c\  +  \[n  -  1,L]G\  = 
=  3  A L(n  -  1)  +  2Ai(n  -  2)  +  \[n  -  1,L]A\ . 

and 

\L(n  -  1)  =  3  A L(n  -  2)  +  \[n  -  1,  T].4| . 

These  formulas  yield  the  recurrent  equation 

A l  (n)  =  4Xl  (n  -  1)  -  XL  (n  -  2) ,  n  =  3,4..., 

formulated  in  Lemma  1.  Using  the  similar  arguments,  one  can 
prove  Lemma  2  for  set  L  =  {TA,  AT}  and  Lemma  3  for 
set  L  =  {TA,  AT,  AA,TT}. 


B.  Random  Coding  Bound  for  Fibonacci  L-Ensemble 
Let 

PL  =  log4  r,  PL  =  log4  C3(1  +  0a2^1  +  0a)2- 

where  r  =  r(L),  C  =  C(L),  a  =  a(L)  and  0  =  0 (L)  are 
introduced  in  Propositions  4  and  5  and  given  by  formulas  (20). 
For  sets  L  defined  by  (12),  parameters  (20)  are  calculated 
by  formulas  (21)-(23).  In  Sect.  IV,  using  a  random  coding 
method  [4],  we  present  a  brief  proof  of 

Theorem  2.  For  any  distance  fraction  <1  >  0,  the  rate  (13) 
satisfies  inequality 

RL{d)  >  RL(d )  =  min  {(1  -u)pL  -  EL{u)}, 

0  <u<d 

where 

El(u)  =  max  EL(v,u), 

0<t;<min{tz,  1  —  u} 

EL(v,u )  =  -p'L  ■  v  +  (1  -u)hi  +  2  u  hi  Q)  , 

h(u )  =  —u  log4  u  —  (1  —  u)  log4(l  —  u). 


Let  a  number  dG,  0  <  di  <  1,  be  the  unique  root  of 
equation  R  L  ( d)  =  0  or  (1  —  d)pi  =  Ejfid).  Obviously,  if 
0  <  d  <  cif,,  then  i?i((i)  >  0  and  the  following  lower  bound 

RL{d)  >  RL(d )  =  (1  -  d)pL  -  EL(d),  0<d<dL, 


holds.  Function  Rjid)  is  called  a  random  coding  bound  on  the 
rate  RL(d).  We  will  say  that  the  number  (if,,  0  <  di  <  1,  is  a 
critical  distance  fraction  of  the  random  coding  bound  RL(d ) 
for  DNA(n,  T)-ensemble. 

For  sets  (12),  our  calculations  based  on  Lemmas  1-3  give 
the  following  numerical  values  for  critical  distance  fractions: 


dL 


0.4794  if  L  =  0, 

0.4316  if  L  =  {TA}, 

0.4054  if  L  =  {TA,  AT}, 

0.3487  if  L  =  {TA,  AT,  A  A,  TT}. 
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C.  Random  Coding  Bound  for  DNA  ( n,dn,w)-Codes 

Let  R(w\d),  d  >  0,  be  the  rate  (11)  of  DNA  ( n,dn,w )- 
codes  and  di,  0  <  dj  <  1,  is  the  critical  distance  fraction 
of  random  coding  bound  RL(d)  for  Fibonacci  L-ensemble. 
Propositions  3  and  Theorem  2  lead  to 

Theorem  3.  If  0  <  d  <  =  ma x{wL  ■  d i},  then  the 

rate  R^w'1  (d)  >  0  and  lower  hound 

R'"  '(d)  >  R(w)(d )  =  max  J 

holds. 

Function  J±  (d)  is  called  a  random  coding  bound  for  DNA 
(n,dn,w)- codes.  The  number  Sw'>  >  0  is  called  a  critical 
distance  fraction  of  the  random  coding  bound  R}W\<1).  For 
instance,  if  weight  function  w  =  w(a,  b)  is  defined  by  Table  1, 
then  for  sets  (12),  numbers  (15)  and  (27)  give: 

{0.29  if  L  =  0, 

0.38  if  L  =  {T,4}, 

0.41  if  L  =  {TA,AT}, 

0.45  if  L  =  {TA,  AT,  AA,  TTj. 

Therefore,  the  corresponding  critical  distance  fraction  is 
t /<“)  =  ma x{wL  ■  di}  =  0.45. 

IV.  Proof  of  Theorem  2 

Let  S(x,y )  be  2-stem  similarity  (1)  for  the  uniform  weight 
function.  For  an  arbitrary  integer  s  G  [n  —  1],  define  the  set 
VL(n,s)  =  {(x,y)  G  [n,L]  x  [n,L\  :  S(x,y)  =  s}. 

Lemma  4.  The  size 

min{s,n— s}  ,  ^ 

\VL(n,s)\  <  rS+J  (S  I  !)  [^(1  +  Pa2)]J  >< 

j=i  J 

x  |  rn  -  •'  [C  (1  +  0  a)f+1  ("  T  }  ,  (28) 

where  r  =  r(L),  C  =  C(L),  a  =  a(L)  and  [5  =  P(L)  were 
introduced  in  the  formulation  of  Theorem  2. 

The  random  coding  method  of  [4],  Lemma  4  and  an  asymp¬ 
totic  analysis  on  the  right-hand  side  of  (28)  yield  Theorem  2. 
To  complete  the  proof  of  Theorem  2,  we  give 

Proof  of  Lemma  4.  Consider  a  pair  (x.y)  G  An  x  An  for 
which  S(x,y)  =  s.  Then  there  exists  z  G  Z(x,y),  |z|  <  n,  and 
the  integer  j  =  k(z,x,y )  <  |z|/2  for  which  equalities 

s  =  |z|  -  j  <=>  |z|  =  s  +  j  n  -  |z|  =  n  —  s  —  j 

take  place.  It  follows  that  for  any  z  G  Z(x,y).  the  number 
j  =  k(z,x,y)  satisfies  inequalities  1  <  j  <  min{s  ;  n  —  s}  . 

Obviously,  the  number  of  all  ways  to  distribute  |z|  indistin¬ 
guishable  marbles  in  j  boxes  provided  that  each  of  j  boxes 
contains  >  2  marbles  is  ("  '[ ) .  In  addition,  the  number  of  all 
ways  to  distribute  n  —  |z|  indistinguishable  marbles  in  j  -f-  1 
boxes  if  empty  boxes  are  accepted  is  (”  7'sj . 

Let  1  <  j  <  h  <  n  be  fixed  integers  and 

{be}  =  {be ,b2, . . .  ,be, . . .  ,bj),  bt  >  1, 


is  an  ordered  collection  of  integers.  For  m  =  1,2,  introduce 
two  sets 


(M)m  =  j  {bt}  ■  J2  b*  =  b >  h*  ^  mj  (29> 

and  define  numbers 

=  max  { n  Ai(6^)  i  (30) 

1  ”  m  lf=i  J 

Applying  above  formulas  and  notations,  one  can  see  that 
for  any  s  G  [n  —  1],  the  cardinality 


min{s ;  n— 5} 

\VL(n,s)\  <  U  .  s +j) 

3= 1 


X 


Ai  (j  +  1 ,  n  -s-  j ) 


n  —  s 


(31) 


From  definition  (29)-(30)  and  upper  bound  (19)  it  follows  that 
for  m  =  1, 2, 


A i0\&)  <  max 


3 

n  icrbe  (!  +  3ahi)] 


< 


<  Cj  rb  max  i  TT  [l  +  0abA  1  <  rb  [(7(1  +  0am)]j  . 

These  inequalities  and  (31)  lead  to  (28). 

Lemma  4  is  proved. 
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